[CI/Build] Bump flashinfer to v0.6.10 by arpera · Pull Request #41711 · vllm-project/vllm

arpera · 2026-05-05T08:19:58Z

Purpose

Bump FlashInfer from v0.6.8.post1 to v0.6.10.
Adjust installation to use flashinfer-python[cu13] extra for cu13 users.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-05-05T08:20:55Z

Hi @arpera, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request updates the FlashInfer version to 0.6.10 across the project's Docker configurations and dependency files. It also introduces conditional logic in the Dockerfile and setup.py to include the [cu13] extra for flashinfer-python when CUDA 13 is detected, facilitating support for SM100 GDN kernels. I have no feedback to provide.

pavanimajety · 2026-05-05T15:39:07Z

FYI: 0.6.9 update - #40998

arpera · 2026-05-05T16:01:02Z

Yes, I have seen this PR #40998, thanks. It wasn't finished, so I think now v0.6.10 makes more sense.

arpera · 2026-05-05T16:15:55Z

I would also like to point out that in this PR, in addition to directly integrating the new FI version v0.6.8, I made a small fix that wasn't accounted for in vLLM when integrating previous FI versions.
Specifically, I added installation of the flashinfer-python[cu13] extras in cases where the user has cu13 installed. Right now this is necessary because without the extras, Flashinfer does not install nvidia-cutlass-dsl[cu13] extras by default, which is required in particular for using the FI Blackwell GDN implementation whose support I'm currently trying to add here: #40717.

There is also a small discussion about this issue in comments: 1, 2.

Since I don't have much experience managing build dependencies in vLLM, I'd be happy to get suggestions for a more correct way to handle this in vLLM.

wzhao18 · 2026-05-05T21:14:42Z

I am noticing some potential numeric issues with the newer flashinfer versions. Specifically, the generation length for GPQA with DSv4 with the new versions are significantly longer than before (claude suggests the model is stuck in self-doubt loop).

I am still investigating the issue. But just wanted to flag this out. It may be worth doing some more eval studies before merging this.

vadiklyutiy · 2026-05-06T12:45:39Z

Do I understand correctly that if have environment with cu13 and do pip install flashinfer-python it doesn't install everything and users have to additionally make pip install flashinfer-python[cu13]?

arpera · 2026-05-06T12:56:10Z

Yes, you understand right

wzhao18 · 2026-05-06T21:56:42Z

@arpera With more investigation, I think the issue that I was hitting was not related to newer flashinfer versions (but with something else). I tested v0.6.10 GPQA eval with deepseek v4, it looks good. I have no more concern for upgrading.

aleozlx · 2026-05-07T20:09:54Z

side note: we released 0.6.10.post1 not long ago for fixing an allreduce hang about missing rendezvous sync group
https://github.com/flashinfer-ai/flashinfer/commits/v0.6.10.post1/

arpera · 2026-05-07T20:19:41Z

@aleozlx, that is good to know, thank you!
I also see from failed CI jobs here that there are some jobs that has huge logs ~7GBs which is really odd. The problem is probably due to use of TLLM_LOG_LEVEL=debug by some of the FI's kernels in v0.6.10. I haven't yet tried to reproduce it locally, just some pre-analysis based on logs.
Could you please check whether some of FI's kernels were compiled with this env var set to degug?

johnnynunez · 2026-05-09T16:18:50Z

0.6.11 is out cc @mgoin

## 📌 Description `gen_jit_spec` adds `-DNDEBUG` only to `extra_cuda_cflags` (consumed by `nvcc` for `.cu` files), not to `extra_cflags` (consumed by `g++` for host-side `.cpp`). Several host-only translation units are part of MoE/GEMM JIT specs — most notably `csrc/nv_internal/cpp/common/logger.cpp` — and they end up compiled without `NDEBUG` while the rest of the module is a release build. For the TensorRT-LLM logger this matters because of: ```cpp // csrc/nv_internal/include/tensorrt_llm/common/logger.h #ifndef NDEBUG Level const DEFAULT_LOG_LEVEL = DEBUG; #else Level const DEFAULT_LOG_LEVEL = INFO; #endif ``` With `NDEBUG` missing on the host side, every prebuilt `flashinfer-jit-cache` wheel ships with `Logger::level_ = DEBUG (10)`. On Hopper this turns each MoE forward pass into a stream of `[TensorRT-LLM][DEBUG] ... sm90_generic_mixed_moe_gemm_kernelLauncher ...` lines from the OSS CUTLASS kernel dispatcher. Verified by reading the data-section initializer of `Logger::Logger()` in the released `flashinfer-jit-cache==0.6.10+cu130` `fused_moe_{90,100,103,120,trtllm_sm100}.so` — all five start `Logger` with `DEFAULT_LOG_LEVEL=10` and `level_=10`, even though the same wheels carry no `.debug_*` sections (i.e. they are otherwise release-built). The fix is one line: also append `-DNDEBUG` to the host `cflags` when not in debug mode. The `flashinfer-jit-cache` wheel build picks this up automatically and the prebuilt logger flips back to `INFO`. ## 🔍 Related Issues Initially this bug was observed during integration of FI v0.6.10 into vLLM: [[CI/Build] Bump flashinfer to v0.6.10 #41711](vllm-project/vllm#41711). There is a CI job log failure due to this issue: [buildkite/ci/pr/distributed-tests-2-gpus-h100](https://buildkite.com/vllm/ci/builds/64532#019df966-e67d-4c27-af0e-76b00bc496e5). Surfaced while debugging a downstream CI step that produced a 2.9 GB log dominated by TRT-LLM debug prints from `fused_moe_90.so`. No FlashInfer issue tracking this yet — happy to file one alongside this PR if useful. ## 🚀 Pull Request Checklist ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit`. - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`pytest tests/test_jit_cpp_ext.py`). Two regression tests added in `tests/test_jit_cpp_ext.py`, mirroring the existing `test_debug_jit_uses_sccache_compatible_nvcc_device_debug_flag` style: ``` pytest tests/test_jit_cpp_ext.py -v ``` ``` test_release_jit_propagates_ndebug_to_host_cflags PASSED test_debug_jit_does_not_propagate_ndebug PASSED ``` The first asserts that a release build (`FLASHINFER_JIT_DEBUG`/`FLASHINFER_JIT_VERBOSE` unset) puts `-DNDEBUG` in **both** `spec.extra_cflags` and `spec.extra_cuda_cflags`. The second locks in symmetry: with `FLASHINFER_JIT_DEBUG=1` neither list contains `-DNDEBUG`. Without the fix, the first test fails on `assert "-DNDEBUG" in spec.extra_cflags`. ## Reviewer Notes Single-line behavior change in `flashinfer/jit/core.py`. No effect on debug builds. Prebuilt wheels rebuilt from this commit will pick up the change automatically — no schema/version bump needed.  ## Summary by CodeRabbit * **New Features** * JIT-compiled code now includes optimized compilation flags in release mode for improved performance. * **Tests** * Added test coverage for proper compilation flag handling between debug and release build modes. [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3278)  Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera · 2026-05-11T19:57:21Z

Current up-to-date FI's version (v0.6.11) has a problem that some of the kernels were compiled with DEFAULT_LOG_LEVEL=DEBUG. Due to this problem some of the tests in our CI run failed, for example, buildkite/ci/pr/gpqa-eval-gpt-oss-h100. I managed to fix that issue of FI's side and merged this fix into upstream. fix(jit): propagate -DNDEBUG to host-side cflags#3278.

Then I have a question that I would like to ask you. As I understand we have now two options:

Wait until next FI release to have this patch merged
Make some temporary solution in vllm to override DEFAULT_LOG_LEVEL from DEBUG to INFO

What do you think about these options? Should we wait until next release or we should do a temporary solution in vllm and remove it later?

wzhao18 · 2026-05-11T20:02:15Z

@arpera Can you request for a patched release 0.6.11.post1 with the fix you just merged? Should be a reasonable thing to ask.

also, is the failure caused by the log level or some other issues that were hard to check due to the verbose log level?

arpera · 2026-05-11T20:18:30Z

Can you request for a patched release 0.6.11.post1 with the fix you just merged?

Thanks for suggestion! I will ask FI's team to do this.

also, is the failure caused by the log level or some other issues that were hard to check due to the verbose log level?

It was hard to check if there is other issues besides that one because logs were about several GBs each. I see that some of failed CI jobs are flaky and failed not due to update. Nevertheless there is still a possibility that FI's version update caused something else in our CI.

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

[CI/Build] Bump flashinfer to v0.6.10

f336153

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

claude Bot reviewed May 5, 2026

View reviewed changes

mergify Bot added ci/build nvidia labels May 5, 2026

github-project-automation Bot added this to NVIDIA May 5, 2026

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Merge branch 'main' into bump-flashinfer-0.6.10

9d91406

pavanimajety added the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label May 5, 2026

Merge branch 'main' into bump-flashinfer-0.6.10

0d1be65

pavanimajety removed the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label May 5, 2026

arpera mentioned this pull request May 6, 2026

[GDN] Enable FI Blackwell GDN prefill kernel #40717

Open

4 tasks

djmmoss mentioned this pull request May 6, 2026

[Attention] Add head_dim=512 support for FlashInfer trtllm attention backend #38822

Open

arpera mentioned this pull request May 9, 2026

fix(jit): propagate -DNDEBUG to host-side cflags flashinfer-ai/flashinfer#3278

Merged

5 tasks

[CI/Build] Bump flashinfer to v0.6.11rc1

c9a2524

Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>

arpera requested a review from Harry-Chen as a code owner May 11, 2026 22:05

Merge branch 'main' into bump-flashinfer-0.6.10

a4383b1

Uh oh!

Conversation

arpera commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

pavanimajety commented May 5, 2026

Uh oh!

arpera commented May 5, 2026

Uh oh!

arpera commented May 5, 2026

Uh oh!

wzhao18 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadiklyutiy commented May 6, 2026

Uh oh!

arpera commented May 6, 2026

Uh oh!

wzhao18 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aleozlx commented May 7, 2026

Uh oh!

arpera commented May 7, 2026

Uh oh!

johnnynunez commented May 9, 2026

Uh oh!

arpera commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wzhao18 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arpera commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

arpera commented May 5, 2026 •

edited

Loading

wzhao18 commented May 5, 2026 •

edited

Loading

wzhao18 commented May 6, 2026 •

edited

Loading

arpera commented May 11, 2026 •

edited

Loading

wzhao18 commented May 11, 2026 •

edited

Loading